New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pure julia exp function #19831
Add pure julia exp function #19831
Conversation
function exp{T<:Union{Float32,Float64}}(x::T) | ||
xu = reinterpret(Unsigned, x) | ||
xs = xu & ~sign_mask(T) | ||
xsb = Int(xu >> Unsigned(sizeof(T)*8 - 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably just as efficient to use signbit
, maybe I should use it instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya, would be clearer
|
||
using Base: sign_mask, exponent_mask, exponent_one, exponent_bias, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to import any of these since we don't extend them
i686 error is the usual failing error and is unrelated |
@@ -70,11 +71,23 @@ end | |||
|
|||
# evaluate p[1] + x * (p[2] + x * (....)), i.e. a polynomial via Horner's rule | |||
macro horner(x, p...) | |||
@gensym val |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary? Won't the macro hygiene automatically convert t
into a unique name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point. I think this was an issue in an earlier version of Julia, but in 0.6 converts it into a unique name.
MINEXP(::Type{Float32}) = -103.97207708f0 # log 2^-150 | ||
|
||
# coefficients from: lib/msun/src/e_exp.c | ||
@inline exp_kernel{T<:Float64}(x::T) = @horner_oftype(x, 1.66666666666666019037e-1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x::Float64
?
|
||
# evaluate p[1] + x * (p[2] + x * (....)), i.e. a polynomial via Horner's rule | ||
# and convert coefficients to same type as x | ||
macro horner_oftype(x, p...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this macro needed? If you change the floating-point type, you generally will change the degree of the polynomial as well, and you can just use literals of the correct type, no? e.g. it looks like you only use it for Float64
and Float32
below, and Julia has a compact syntax for Float32
literals (append f0
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it will generally change the degree of the polynomial. That being said, for example one case where it doesn't really change is if you also want to have native Float16 support (although on CPUs this is useless) the dispatch on the same polynomial (the one for ::Float32
) works well in practice (from many cases on different function I found it's really hard to find good coefficients by reducing the degree from the Float32 version for Float16). For example this function will work just as well within accuracy for Float16
. Thus it's helpful to define this macro to take care of such dispatch cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not add the extra macro for an odd case (it is unusual that you wouldn't be able to reduce the degree for Float16) that we don't really implement (probably on CPUs we will just compute exp(x::Float16)
by converting to/from Float32, no?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not add the extra macro for an odd case (it is unusual that you wouldn't be able to reduce the degree for Float16) that we don't really implement
Not really unusual for Float16 (as opposed to going from Float128 -> Float64). You have a lot less precision so it's very hard to find an even smaller degree polynomial that retains the same accuracy constraint ( < 1 ulp). This is the hardest/time consuming part testing accuracy and the polynomial coeffs.
(probably` on CPUs we will just compute exp(x::Float16) by converting to/from Float32, no).
yes
I will remove anyways
MINEXP(::Type{Float32}) = -103.97207708f0 # log 2^-150 | ||
|
||
# coefficients from: lib/msun/src/e_exp.c | ||
@inline exp_kernel{T<:Float64}(x::T) = @horner_oftype(x, 1.66666666666666019037e-1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just exp_kernel(x::Float64)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally my motivation was to have two or more polynomials depending on the float type and use dispatch as follows:
- A low degree polynomial for Float16 and Float32
- A higher degree polynomial for FLoat64 and other Floats like Float128
so in: https://github.com/JuliaMath/Amal.jl/blob/master/src/exp.jl
I have the two polynomial kernels dispatch the following types
SmallFloat = Union{Float16,Float32}
LargeFloat = Float64 # perhaps Union{Float64,Float128} in the future
That grand goal might be a pipedream. So if you all prefer I can restrict the types instead of using this and the @horner_oftype
machinery. (although it does have some elegance to have this this current machinery with horner_oftype
and the polynomial dispatch on the type)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer eliminating the extra macro for now.
(Wouldn't the Float16
computations be faster in Float32
arithmetic anyway? And wouldn't we need a higher degree for Float128
than for Float64
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep it would be more efficient on the CPU. The motivation was for GPU code where you do have native Float16 hardware and there is a real benefit (there is some very nice work coming along in having better Float16 support). For Float128
I have not tested yet, but I suspect so.
I guess I can revisit when the time comes and restrict the polynomial dispatch and the additional macro.
# Float64, which is well within the allowable error. however, | ||
# 2.718281828459045 is closer to the true value so we prefer that answer, | ||
# given that 1.0 is such an important argument value. | ||
if x == T(1.0) && T == Float64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe x == T(1.0) && return T(2.718281828459045235360)
so that we get the exactly rounded result for Float32
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Float32
is exactly rounded without this. So it's better to do this to remove the extra check for the Float32 version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between T==Float64
and T===Float64
I'm leaning towards using the later but for no reason
end | ||
else | ||
n = round(T(LOG2E)*x) | ||
k = unsafe_trunc(Int,n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems bad that this will behave differently on 32-bit and 64-bit machines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it doesn't matter in this case. k
is the integer representing the exponent of the float, so there is no way it will cause issues since the number will overflow way past when the Int32
will. We could all the same just use Int32
but I think it might be nicer to just use the machine size Int
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use Int32
, I think. (We could even use Int16
, it sounds like, but there's probably no advantage to that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the change, but I don't fully understand why we should prefer Int32
over Int
. I tried both and it's not like there is any critical difference in the resulting code.
Would be good to have test cases that exercise each of the branches, including test cases just above and just below the various thresholds like |
Out of curiosity, why is it so much faster than the openlibm version? More inlining? |
I have carefully tested this function here: https://github.com/JuliaMath/Amal.jl/tree/master/test |
Hard for me to tell without seeing something like |
@musm, I don't doubt that you have tested your code, but we should have the key tests in our test suite, at the very least to make sure that we don't get regressions in the future. |
Addressed some of the review comments. I added a function to fpintsize(::Type{Float64}) = UInt64
fpintsize(::Type{Float32}) = UInt32 This would make this and one of the other float function cleaner reinterpret(T, rem(Int32(32), fpintsize(T))
# vs. what we use now
xu = reinterpret(Unsigned,x)
reinterpret(T, rem(Int32(32), typeof(xu)) Need to still add tests and benchmarks |
# integer size of float | ||
fpintsize(::Type{Float64}) = UInt64 | ||
fpintsize(::Type{Float32}) = UInt32 | ||
fpintsize(::Type{Float16}) = UInt16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fpintsize
sounds like it should return a number (64
or 8
)... maybe fpint
or fpinttype
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fpinttype
sounds good to me
@pure exponent_raw_max{T<:AbstractFloat}(::Type{T}) = Int(exponent_mask(T) >> significand_bits(T)) | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to get rid of all these additional white spaces :)
-1.65339022054652515390e-6, 4.13813679705723846039e-8) | ||
|
||
# coefficients from: lib/msun/src/e_expf.c | ||
@inline exp_kernel(x::Float32) = @horner(x, 1.6666625440f-1, -2.7667332906f-3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This is such a low degree that I wonder if we'd be better off using a minimax polynomial rather than a minimax rational function, at least for the single-precision case — the optimal polynomial would require a higher degree, but it would avoid the division. I don't think we need to address that possibility in this PR, however, since this is no worse than what we are doing now.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This a good question. I have already played with many variations and my conclusion is that the msun version is the best of all things I have tested and played with considering both accuracy (critical goal of being < 1 ulp) and performance.
Here's the problem with the minmax polynomial approach for the exp function:
on my test suite for Float32
FMA system rational
max 0.86076152 at x = -5.85780 mean 0.03712445
NON-FMA system rational
max 0.86076152 at x = -5.85780 mean 0.03712434
FMA system polynomial
max 0.84459400 at x = -48.8587 mean 0.03627487
NON-FMA system polynomial
max 1.09986448 at x = 5.23060 mean 0.03798720
over the range: vcat(-10:0.0002:10, -1000:0.001:1000, -120:0.0023:1000, -1000:0.02:2000)
(errors may be larger, since we haven't tested all floats)
as you can see it's not possible to be under 1 ulp for non-fma systems using this polynomial (below)
What about speed?
Well I tested 3 hot branches
and the polynomial version on an fma system is ~ 0.5-0.8 ns slower
For reference this is the best minmax polynomial I can find for the Float32
@inline exp_kernel{T<:SmallFloat}(x::T) = @horner_oftype(x, 1.0, 1.0, 0.5,
0.1666666567325592041015625,
4.1666455566883087158203125e-2,
8.333526551723480224609375e-3,
1.39357591979205608367919921875e-3,
1.97799992747604846954345703125e-4)
Note it's not as simple as adding more degrees to this polynomial to improve the accuracy
This is the best minimal minmax polynomial I was able to find that meats < 1 ulp
For Float64
@inline exp_kernel{T<:LargeFloat}(x::T) = @horner_oftype(x, 1.0, 1.0, 0.5,
0.16666666666666685170383743752609007060527801513672,
4.1666666666666692109277647659837384708225727081299e-2,
8.3333333333159547579027659480743750464171171188354e-3,
1.38888888888693412537733706813014578074216842651367e-3,
1.9841269898657093212653024227876130680670030415058e-4,
2.4801587357008890921336585755341275216778740286827e-5,
2.7557232875898009206386968239499424271343741565943e-6,
2.7557245320026768203034231441428403286408865824342e-7,
2.51126540120060271373185023340013355408473216812126e-8,
2.0923712382298872819985862227861600493028504388349e-9)
I haven't tested this in a while but the same conclusions holds
@inline exp_kernel(x::Float32) = @horner(x, 1.6666625440f-1, -2.7667332906f-3) | ||
|
||
# custom coefficients (slightly better mean accuracy, but also a bit slower) | ||
# @inline exp_kernel(x::Float32) = @horner_oftype(x, 0.1666666567325592041015625f0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to edit this to @horner
or delete the comment entirely
@@ -0,0 +1,137 @@ | |||
# Based on FreeBSD lib/msun/src/e_exp.c | |||
# ==================================================== | |||
# Copyright (C) 2004 by Sun Microsystems, Inc. All rights reserved. Permission |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to add exceptions to the license file and contrib/add_license_to_files.jl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't' really know what you mean. What do I need to do?
I looked under here for example:
https://github.com/JuliaLang/openlibm/blob/master/LICENSE.md
This part is kinda relevant
OpenLibm
OpenLibm contains code that is covered by various licenses.
The OpenLibm code derives from the FreeBSD msun and OpenBSD libm implementations, which in turn derives from FDLIBM 5.3. As a result, it has a number of fixes and updates that have accumulated over the years in msun, and also optimized assembly versions of many functions. These improvements are provided under the BSD and ISC licenses. The msun library also includes work placed under the public domain, which is noted in the individual files. Further work on making a standalone OpenLibm library from msun, as part of the Julia project is covered under the MIT license. The test files, test-double.c and test-float.c are under the LGPL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I just need to add a link to LICENSE.md to openlibm (it's already there) and say that was used as a reference or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Julia license file needs to have an exception here, and the script that adds license headers needs to add an exception for this file. It's derived code so should usually retain its original licensing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an example I can copy to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I should add this line
- base/special/exp.jl' (see [FREEBSD MSUN](https://github.com/freebsd/freebsd) [FreeBSD/2-clause BSD/Simplified BSD License])
below
The following components of Julia's standard library have separate licenses:
even if it's derived code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# ====================================================
# Copyright (C) 2004 by Sun Microsystems, Inc. All rights reserved. Permission
# to use, copy, modify, and distribute this software is freely granted,
# provided that this notice is preserved.
# ====================================================
do I even need to include this? We are distributing the original file.
Is it necessary to add a line here https://github.com/JuliaLang/julia/blob/master/contrib/add_license_to_files.jl#L39 to skip this?
I'm not sure. Parts of this include ported and derived code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copyright sun microsystems, permission to modify and distribute granted provided this notice is preserved. So it's not MIT license, copyright JuliaLang contributors like the majority of the rest of base. We're not distributing the original file in this repo under our MIT license.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I added the commit to address this : 46058d3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You only did half of it.
Is it necessary to add a line here https://github.com/JuliaLang/julia/blob/master/contrib/add_license_to_files.jl#L39 to skip this?
Yes. "exceptions to the license file and contrib/add_license_to_files.jl"
It is very cool that it is possible to do this in julia. |
Running nano soldier after JuliaCI/BaseBenchmarks.jl#58 should be an interesting test since there are quite a few other files in BaseBenchmarks that use the |
Ok so the benchmarks have been merged. |
Should we trigger nanosoldier now? |
@nanosoldier Also this should be squashed before committing it to master. |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels |
Performance looks good; the regressions are probably noise. But I don't think nanosoldier was running the new exp benchmarks? |
Can we retrigger the benchmarks? They were just uploaded to nanosoldier. Thanks |
@nanosoldier |
# compute approximation | ||
z = r*r | ||
p = r - z*exp_kernel(z) | ||
y = T(1.0) - ((lo - (r*p)/(T(2.0) - p)) - hi) # compute polynomial on reduced arugment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
argument
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels |
bd4945c
to
7323b8f
Compare
Squashed and g2g on my end. The regressions don't make sense to me since those files don't even use 'exp'. Not sure why we also don't see more improvements as I locally have benchmarked on other branches in exp, but I think this is just some resolution/noise issue on nanosoldier. |
bump |
* Add pure julia exp function * Add a few more ldexp tests
Need to add benchmarks, but it is faster or the same speed as the openlibm version between 1.0 (same speed) to 1.15 to to 1.3 to 1.5 times faster depending on which branch you test. (Testing using BenchmarkTools). On most branches it's faster, I think there is only one branch where it's a basically the same speed.Accuracy is basically the same (both based on msun version)